Bayes Optimal Decision Rule and Types of Classification Models

Accompanying Notebook for my presentation in the seminar Applied Statistics, Department of Mathematics and Natural Sciences, University of Kassel



in progress.


Introduction

This Notebook complements the presentation by

  • providing the proofs discussed, and

  • implementing Gaussian Naive Bayes in R.


Topics

  1. Bayes Optimal Decision Rule

  2. Generative vs. discriminative classification models

  3. Introduction of Naive Bayes Classifier with Gaussian distribution as an example of a generative classification model


Orientation Map


This is my own representation and reflects my learning outcomes based on the given literature and with regard to the seminar topic. It is not intended to be exhaustive. For example, there are more generative models than Naive Bayes, and the indicator loss does not necessarily lead to logistic regression in a discriminative learning context.


Key Concepts

Marginalization

For a pair of random variables (X,Y), the density f_X(x) of a marginal variable X is obtained from the joint pdf f_{X,Y}(x,y) by integrating out the other variable:

f_X(x)=\int f_{X,Y}(x,y) \, \mathrm{d}y.

Thus, if the joint density is known, the information about the individual variables can be recovered (Kroese et al., 2024, p. 427). This operation is called marginalization.

Conditional Density

Let X and Y be random variables with joint pdf f_{X,Y}(x,y), and assume f_X(x)>0. Then the conditional pdf of Y given X=x is

f_{Y \vert X}(y \vert x)= \frac{f_{X,Y}(x,y)}{f_X(x)}

(Kroese et al., 2024, p. 431).

Conditional Expectation

In the continuous case, the conditional expectation of a random variable Y given X=x is defined as

\mathbb{E}[Y \vert X=x]=\int y \;f_{Y \vert X}(y \vert x) \, \mathrm{d}y

(Kroese et al., 2024, p. 431). Replace the integral with a sum for the discrete case. “Note that \mathbb{E}[Y \vert X=x] is a function of x. The corresponding random variable is written as \mathbb{E}[Y \vert X]” (Kroese et al., 2024, p. 431).


Proofs

Tower Property

\begin{align*} &\text{Claim:} &&\mathbb{E}[\mathbb{E}[Y \vert X]] &&=\mathbb{E}[Y] \\[12pt] &\text{Proof:} &&\mathbb{E}[Y \vert X] &&= \int_{\mathbb{R}} y \, f_{Y \vert X}(y \vert x) \, \mathrm{d}y &&& &&& \color{transparent}{\text{xxxxxxxxxxxxxxx}} \\[4pt] & && &&=\int_{\mathbb{R}} y \, \frac{f_{X,Y}(x,y)}{f_X(x)} \, \mathrm{d}y \\[4pt] &\Leftrightarrow &&\mathbb{E}[\mathbb{E}[Y \vert X]] &&=\int_{\mathbb{R}} \int_{\mathbb{R}} y \, \frac{f_{X,Y}(x,y)}{f_X(x)} \, \mathrm{d}y \, f_X(x) \, \mathrm{d}x \\[4pt] & && &&=\int_{\mathbb{R}} \int_{\mathbb{R}} y \, f_{X,Y}(x,y) \, \mathrm{d}y \, \mathrm{d}x \\[4pt] & && &&=\int_{\mathbb{R}} y \, \int_{\mathbb{R}} f_{X,Y} \, \mathrm{d}x \, \mathrm{d}y \\[4pt] & && &&=\int_{\mathbb{R}} y \, f_Y(y) \, \mathrm{d}y \\[4pt]& && &&=\mathbb{E}[Y] &&& \Box \end{align*}


\mathbb{E}[𝟙_Y]=\mathbb{P}(Y)

\bullet \ (\boldsymbol{X},Y) \text{ with joint pdf }f_{\boldsymbol{X},Y}(\boldsymbol{x},y)

\bullet \ \operatorname{g}:\mathbb{R}^d \to \{0, \ldots, c-1\}

\bullet \ \operatorname{Loss}(Y, \operatorname{g}(\boldsymbol{X}))=\mathbb{1}\{Y \neq \operatorname{g}(\boldsymbol{X})\}


\begin{align*} &\text{Claim:} &&\mathbb{E}[𝟙\{Y\neq\operatorname{g}(\boldsymbol{X})\}] &&=\mathbb{P}(Y\neq\operatorname{g}(\boldsymbol{X})). \\[16pt] &\text{Proof:} &&\mathbb{E}[𝟙{\{Y\neq\operatorname{g}(\boldsymbol{X})\}}] &&=\int_{\mathbb{R}^d\times\{0,\ldots,c-1\}}𝟙\{y\neq\operatorname{g}(\boldsymbol{x})\} \ f_{\boldsymbol{X},Y}(\boldsymbol{x},y) \, \mathrm{d}y \, \mathrm{d}\boldsymbol{x} \\[4pt] \\[4pt] & && &&=\int_{\mathbb{R}^d} \, \int_{\{0, \ldots, c-1\}} 𝟙 \{y \neq \operatorname{g}(\boldsymbol{x})\} \, f_{\boldsymbol{X},Y}(\boldsymbol{x},y)\, \mathrm{d}y\, \mathrm{d}\boldsymbol{x} \\[4pt] & && &&=\int_{\mathbb{R}^d} \, \int_{\{y \in \{0, \ldots, c-1\}: \, y\neq\operatorname{g}(\boldsymbol{x})\}} \, f_{\boldsymbol{X},Y}(\boldsymbol{x},y)\, \mathrm{d}y\, \mathrm{d}\boldsymbol{x} \\[4pt] & && &&=\int_{\{(\boldsymbol{x},y): \, y\neq\operatorname{g}(\boldsymbol{x})\}} \, f_{\boldsymbol{X},Y}(\boldsymbol{x},y) \, \mathrm{d}y \, \mathrm{d}\boldsymbol{x} \\[16pt] & && &&=\mathbb{P}(Y \neq \operatorname{g}(\boldsymbol{X})) \; &&&\Box \\[4pt]\end{align*}


\mathbb{E}[𝟙_Y \vert X]=\mathbb{P}(Y \vert X)

\bullet \ Y \text{ with condition to } \boldsymbol{X}: \; f_{Y \vert \boldsymbol{X}}(y \mid \boldsymbol{x})

\bullet \ \operatorname{g}:\mathbb{R}^d \to \{0, \ldots, c-1\}

\bullet \ \operatorname{Loss}(Y, \operatorname{g}(\boldsymbol{X}))=𝟙\{Y \neq \operatorname{g}(\boldsymbol{X})\}


\begin{align*} &\text{Claim:} &&\mathbb{E}[𝟙\{Y\neq\operatorname{g}(\boldsymbol{X})\} \vert \boldsymbol{X}] &&=\mathbb{P}(Y\neq\operatorname{g}(\boldsymbol{X}) \vert \boldsymbol{X}=\boldsymbol{x}). \\[16pt] &\text{Proof:} &&\mathbb{E}[𝟙\{Y\neq\operatorname{g}(\boldsymbol{X})\} \mid \boldsymbol{X}=\boldsymbol{x}] &&=\int_{\{0,\ldots,c-1\}}𝟙\{y\neq\operatorname{g}(\boldsymbol{x})\} \ f_{Y \vert \boldsymbol{X}}(y \mid \boldsymbol{x}) \, \mathrm{d}y \\[4pt] & && &&=\int_{\{y \in \{0, \ldots, c-1\}: \, y\neq\operatorname{g}(\boldsymbol{x})\}} \, f_{Y \vert \boldsymbol{X}}(y \mid \boldsymbol{x})\, \mathrm{d}y \\[16pt] & && &&=\mathbb{P}(Y \neq \operatorname{g}(\boldsymbol{X}) \mid \boldsymbol{X}=\boldsymbol{x}) &&& \Box \end{align*}


Optimal Classifier

\begin{align*} &\text{Claim:} &&\operatorname{g}^*(\boldsymbol{x}) &&= \underset{y \in \{0,\ldots,c-1\}}{\operatorname{arg\,max}}\,\mathbb{P}\big(Y=y \,\mid \boldsymbol{X}=\boldsymbol{x}\big). \\[18pt] &\text{Proof:} &&\operatorname{Loss}(Y, \operatorname{g}(\boldsymbol{X})) &&= 𝟙{\{\,Y \neq \operatorname{g}(\boldsymbol{X})\,\}} \\[4pt] &\Leftrightarrow && \ell(\operatorname{g}) &&= \mathbb{E}\big[ 𝟙{\{\,Y \neq \operatorname{g}(\boldsymbol{X})\,\}} \big] \\[4pt] & && &&= \mathbb{E} \big[\mathbb{E} \big[𝟙{\{\,Y \neq \operatorname{g}(\boldsymbol{X})\,\}} \mid \boldsymbol{X} \big] \big] \\[4pt] & && &&=\mathbb{E}\big[P(Y \neq \operatorname{g}(\boldsymbol{X}) \mid \boldsymbol{X})\big] \\[4pt] & \overset{\color{darkgreen}{*}}{\Rightarrow} &&\underset{\scriptscriptstyle y \in \{0,\ldots,c-1\}}{\operatorname{arg\,min}} \ \mathbb{P}(Y \neq y \mid \boldsymbol{X}=\boldsymbol{x}) &&=\underset{\scriptscriptstyle y \in \{0,\ldots,c-1\}}{\operatorname{arg\,min}}\big(1-\mathbb{P}(Y = y \mid \boldsymbol{X}=\boldsymbol{x})\big) \\[4pt] & && &&=\underset{\scriptscriptstyle y \in \{0,\ldots,c-1\}}{\operatorname{arg\,max}} \ \mathbb{P}(Y=y \mid \boldsymbol{X}=\boldsymbol{x}) \\[4pt] & && &&=\operatorname{g}^*(\boldsymbol{x}) &&& \Box \end{align*}

\color{darkgreen}{\text{*}} To minimize the expectation (the risk \ell(\mathrm{g})), the conditional probability must be minimized pointwise.


Model Types

There are two fundamental modelling approaches to classification. Generative classification aims to model the joint density f_{\boldsymbol{X}, Y}(\boldsymbol{x}, y). From this joint distribution, the conditional density f_{Y \mid \boldsymbol{X}}(y \mid \boldsymbol{x}) can be obtained, which enables classification.

Discriminative models take a different approach. Instead of modelling the full joint distribution, they estimate directly what is needed for classification - either the conditional probability f_{Y \mid \boldsymbol{X}}(y \mid \boldsymbol{x}) or even only the decision boundary between classes.

Pros and cons

According to Murphy 2012, p. 268-269, generative and discriminative classifiers have following advantages and disadvantages:

Advantages of generative models

  • Generative models are often simple to train. For example, naive Bayes require only counting and averaging. In contrast, logistic regression requires solving a convex optimization problem, which is computionally more demanding.

  • Generative models estimate parameters separatly for each class density. Therefore, a new class can be added without retraining the entire model. Discriminative models must be retrained from scratch, since all parameters are interdependent.

  • Because generative models specify a full probability model for x, they can deal with missing data using marginalization. Discriminative classifiers assume all components are observed, handling missing data is non-trivial.

  • Generative models can easily incorporate unlabeled data (semi-supervised learning), whereas discriminative classifiers struggle with this.

  • Generative models are symmetric in inputs and outputs, enabling inference of inputs from outputs. Discriminative models cannot be run reverse.

Advantages of discriminative models

  • Discriminative models allow arbitrary feature transformation such as basis expansion \phi(\boldsymbol{x}). Generative models have difficulties with Feature-Preprocessing.

  • Generative models like naive Bayes require strong assumptions that are often violated, leading to overconfident posteriors near 0 and 1. Discriminative models such as logistic regression yield better probability calibration.

Some models listed

Murphy 2012, p. 270


Classification via Bayes’ Rule

“Following standard practice in Bayesian context, instead of writing f_X(x) and f_{X \vert Y}(x \mid y) for the pdf of X and the conditional pdf of X given Y, one simply writes f(x) and f(x \mid y). If Y is a different random variable, its pdf (at y) is thus denoted by f(y)” (Kroese et al., 2024, p. 48).

For a class-dependent probability, Bayes’ Rule is \\[18pt] f(y \mid x)= \frac{f(x,y)}{f(x)}. \\[18pt]

The joint pdf f(x,y) can be expressed as f(x \mid y) \, f(y). Assume \boldsymbol{x} is a feature vector. Since f(\boldsymbol{x}) is constant with respect to y, it follows:

\\[18pt] f(y \mid \boldsymbol{x}) \propto f(\boldsymbol{x} \mid y) \, f(y). \\[18pt]

According to Bayes’ notation, f(y \mid x) is called posterior probability. Further “f(x \mid y) is the likelihood of obtaining feature vector \boldsymbol{x} of the class y and f(y) is the prior probability of class y” (Kroese et al., 2024, p. 257). If a feature vector \boldsymbol{x} is to be assigned to a class \hat{y}, classification is performed according to the Bayes Optimal Decision Rule:

\\[18pt] \hat{y}=\arg\max_y \, f(y \mid \boldsymbol{x}), \\[18pt]

which is the Bayesian expression of the Optimal Classifier - that is, the class maximizing the (unnormalized) posterior probability.


Naive Bayes

As shown above, Bayesian classification exploits the fact that f(x) is constant across all y, so it can be omitted in the decision rule. Because the true density of f(y \mid \boldsymbol{x}) is unknown, it is approximated by a function \mathrm{g}(y \mid \boldsymbol{x}) from a specified class of functions \mathcal{G}. The function class \mathcal{G} can then be endowed with distributional assumptions: “In the naïve Bayes method, the class of approximating functions \mathcal{G} is chosen such that \mathrm{g}(\boldsymbol{x} \mid y) \;=\; \mathrm{g}(x_1 \mid y)\, \mathrm{g}(x_2 \mid y)\,\cdots\, \mathrm{g}(x_p \mid y), that is, conditional on the label, all features are independent” (Kroese et al., 2024, p. 258). Assuming a uniform prior over the classes, the factor f(y) is constant across all y and can also be omitted:

\mathrm{g}(y \mid \boldsymbol{x}) \propto \prod_{j=1}^p \mathrm{g}(x_j \mid y). \\[18pt]

Assume the approximating class \mathcal{G} has a Gaussian distribution, i.e. (\boldsymbol{X}_j \mid y) \sim \mathcal{N}(\mu_{yj}, \ \sigma^2), y=0, \ldots, c-1, j=1, \ldots, p. The Gaussian pdf is:

\\[18pt] f(x)=\frac{1}{\sigma\sqrt{2 \pi}} \ \exp \left({{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}}\right), \\[18pt]

insert into naive Bayes factorization:

\\[18pt] \mathrm{g}(\boldsymbol{x}_j \mid y) = \frac{1}{\sigma\sqrt{2 \pi}} \ \exp \left({ -\frac{1}{2}\frac{(x_j-\mu_{yj})^2}{\sigma^2}} \right). \\[18pt]

The scaling factor \frac{1}{\sqrt{2 \pi \sigma^2}} does not depend on y and therefore drops out in the decision rule. Thus, classification is based on comparing the unnormalized posteriors:

\\[18pt] \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \prod_{j=1}^p \exp \left(-\frac{1}{2}\frac{(x_j-\mu_{yj})^2}{\sigma^2} \right), \\[18pt]

which is:

\\[18pt] \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \exp \left(-\frac{1}{2}\sum_{j=1}^p \frac{(x_j-\mu_{yj})^2}{\sigma^2} \right). \\[18pt]

The sum is the Euclidean distance:

\\[18pt] \sum_{j=1}^p(x_j-\mu_{yj})^2= \|x-\mu_y\|^2, \\[18pt]

thus:

\\[18pt] \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \exp \left(-\frac{\|x-\mu_y\|^2}{2\sigma^2} \right), \\[18pt]

where \boldsymbol{\theta} contains the parameters \mu and \sigma, which have to be estimated from the data. “The probability \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) is maximal, when \|x-\mu_y\| is minimal. Thus, \hat{y}=\mathrm{arg \ min}_y\ \|\boldsymbol{x}-\boldsymbol{\mu}_y\| is the classifier. That is, classify \boldsymbol{x} as y when \boldsymbol{\mu}_y is closest to \boldsymbol{x} in Euclidean distance” (Kroese et al., 2024, p. 258).


Application

Implement Gaussian Naive Bayes in R without using any built-in model implementations.

# data handling I

# data are in wide format,
# one row = one image,
# first column contains the labels,
# labels are integers in {0,...,9},
# each image has 28x28 pixels,
# pixels are integers in {0,...,255}.

# LABELS
# select first column, save as vector.
y_train <- as.factor(train[ ,1])
y_test  <- as.factor(test[ ,1])

# PIXELS
# select all rows and columns, 
# except first column, save as matrix.
X_train <- as.matrix(train[ ,-1])
X_test  <- as.matrix(test[ ,-1])
# fitting model / compute parameters

fit_GNB <- function(X, y, eps = 1e-6){
  
  X   <- as.matrix(X)
  y   <- as.factor(y)
  
  priors  <- table(y)/length(y)
  means   <- rowsum(X, y) / as.numeric(table(y))
  vars    <- (rowsum(X^2, y)/as.numeric(table(y))) - means^2
  
  vars[vars < eps] <- eps
  
  list(priors = priors,
       means = means,
       vars = vars)
}


In Fashion-MNIST, Gaussian Naive Bayes estimates one variance for each class-pixel combination, resulting in 10x784 distinct variances. Analogue to the previous section, \frac{1}{\sqrt{2\pi}} can be omitted, because it does not depend on y. Thus, the class-conditional likelihood for pixel j in class y is

\\[18pt] \mathrm{g}(\boldsymbol{x}_j \mid y) = \frac{1}{\sigma_{yj}} \ \exp \left({ -\frac{(x_j-\mu_{yj})^2}{2\sigma_{yj}^2}} \right). \\[18pt]

The corresponding posterior is

\mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \mathrm{g}(y) \prod_{j=1}^{784} \ \left( \frac{1}{\sigma_{yj}} \ \exp \left(-\frac{(x_j-\mu_{yj})^2}{2\sigma_{yj}^2} \right) \right). \\[18pt]

To avoid numerical underflow (Murphy, 2012, p. 86), the log is taken:

\\[18pt] \log \, \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \log\mathrm{g}(y) + \sum_{j=1}^{784} \left( \, \log \left( \frac{1}{\sigma_{yj}}\right) \ -\frac{(x_j-\mu_{yj})^2}{2\sigma_{yj}^2} \right) \\[18pt] = \log\mathrm{g}(y) + \sum_{j=1}^{784} \, \left( -\log \left(\sigma_{yj} \right) -\frac{(x_j-\mu_{yj})^2}{2\sigma_{yj}^2} \right). \\[18pt]

Implementation in R:

# prediction function

pred_GNB <- function(model, X){
  
  X <- as.matrix(X)
  n <- nrow(X)
  
  classes <- names(model$priors)
  K       <- length(classes)
  
  log_post <- matrix(NA, n, K)
  
  for (k in seq_len(K)){
    mu <- model$means[k, ]
    va <- model$vars[k,]
    
    ll <- rowSums(-0.5 * log(va) - (sweep(X, 2, mu, "-")^2) / (2 * va))
    log_post[ ,k] <- ll + log(as.numeric(model$priors[k]))
  }
  
  idx <- max.col(log_post, ties.method = "first")
  factor(classes[idx], levels = classes)
}

\text{accuracy}_{y}=\frac{tp_j+tn_j}{n}

(Kroese et al., 2024, p. 254)

# accuracy

model_GNB   <- fit_GNB(X_train, y_train)
pred_class  <- pred_GNB(model_GNB, X_test)

mean(pred_class == y_test)
[1] 0.1161
# accuracy per class

classes <- levels(y_test)
accuracy_pc <- sapply(classes, function(cl){
  idx <- which(y_test == cl)
  mean(pred_class[idx] == cl)
})

accuracy_pc
    0     1     2     3     4     5     6     7     8     9 
0.082 0.000 0.000 0.017 0.000 0.000 0.078 0.000 0.938 0.046 
# plot first image.

plot_image <- function(row){
  
  pxmat <- matrix(as.numeric(row), nrow = 28, byrow = T)
  pxmat <- t(apply(pxmat, 2, rev))
  
  image(pxmat, col = gray.colors(255), axes = F, asp =1)
}

plot_image(train[1, -1])

# controll label:
train[1, 1]
[1] 2
library(tidyverse)
library(viridis)
# pixel values are not independent 

# plot heatmaps

plot_avg_im <- function(row){
  
  pxmat <- matrix(as.numeric(row), nrow = 28, byrow = T)
  pxmat <- t(apply(pxmat, 2, rev))
  
  image(pxmat, col = viridis(255), axes = T, asp =1)
} 

#heatmap pullover
avg_pul     <- train %>% filter(label == 2) %>% select(-label) %>% colMeans()
avg_pul_df  <- as.data.frame(t(avg_pul))
plot_avg_im(avg_pul_df) 

# heatmap bag
#avg_bag     <- train %>% filter(label == 8) %>% select(-label) %>% colMeans()
#avg_bag_df  <- as.data.frame(t(avg_bag))
#plot_avg_im(avg_bag_df)
# pixel values are not gaussian distributed

# qq-plot




Exercises

1. Conceptual Questions

a) Distinguish between probability and likelihood.

b) What is the core difference between generative and discriminative models?

2. Bayes’ Rule

Given the expression:

f(y \mid \boldsymbol{x}) \propto f(\boldsymbol{x} \mid y) \, f(y)

explain the meaning of

  • the likelihood term

  • the prior

  • the posterior

3. Independence Assumption

Naive Bayes assumes:

\mathrm{g}(\boldsymbol{x} \mid y) = \prod_{j=1}^p \mathrm{g}(x_j \mid y)

Explain what this assumption means.


Literature

D.P. Kroese, Z.I. Botev, T. Taimre, R. Vaisman: Data Science and Machine Learning - Mathematical and Statistical Methods. Chapman and Hall/CRC, 2024.

K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.




Jascha Kottner | 2025